-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXP] minhash unique covered bp #2027
base: latest
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #2027 +/- ##
==========================================
+ Coverage 84.15% 91.66% +7.51%
==========================================
Files 129 98 -31
Lines 15087 10807 -4280
Branches 2119 2119
==========================================
- Hits 12696 9906 -2790
+ Misses 2095 604 -1491
- Partials 296 297 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@ctb I think I need your thoughts on this.
I think The issue is with For the test, I think this is causing the test |
@bluegenes Don't know if this helps your test pass, and/or if I had a typo in my previous email, but since the number of k-mers is L-k+1, if you want to recover L (the number of unique k-mers), you should take num_kmers + k -1. The +1 -> -1 may affect things (and apologies if I led you astray!) |
@dkoslicki excellent, thanks -- @ctb and I puzzled over that yesterday, and then I just left it as will fix - thanks! |
I had trouble thinking this through. So I wrote some code: import sourmash
from sourmash.sourmash_args import SaveSignaturesToLocation
# sourmash sketch dna tests/test-data/short.fa -p scaled=1,k=21 -o short.fa.sig
ss = sourmash.load_one_signature('short.fa.sig')
# pull 'short.fa.sig' into individual k-mers => signatures
siglist = []
for n, hashval in enumerate(ss.minhash.hashes):
mh = ss.minhash.copy_and_clear()
mh.add_hash(hashval)
subsig = sourmash.SourmashSignature(mh, name=f"hash{n}")
siglist.append(subsig)
with SaveSignaturesToLocation('short.hashes.zip') as savesig:
for sig in siglist:
savesig.add(sig)
# then:
# sourmash gather short.fa.sig short.hashes.zip -o xxx.csv --threshold-bp=0 the idea I have is that our gather results should work for scaled=1, and that can maybe guide our results for scaled=1000. note, here I don't know if this helps think through the numbers but it at least makes sense to me :). I'll revisit later... |
more reasoned take number 1: (1) gather fundamentally works on hashes/k-mers, so we should recognize that it is not providing bp estimates and adjust internal comments/code and external documentation appropriately. For now, just make another issue and leave the gather outputs as they were before this PR. take number 2, if you don't like number 1: (2) gather is an iterative algorithm, so we should only to add the k - 1 once - presumably at the very beginning. |
In ANI work, I dropped our rough
scaled
*num_hashes
estimate for bp intominhash.py
as the propertycovered_bp
. Thanks to @dkoslicki for pointing out that 1. this is a bad name, bc we're not taking abundances into account, and 2. we should really be adding (ksize -1).In this PR, I'm updating the property and renaming to
unique_covered_bp
We could also add
covered_bp
that does take into account abundances if desired...